Skip to content

extend indexing for apache#892

Open
chinyeungli wants to merge 16 commits into
mainfrom
631_extend_indexing_for_apache
Open

extend indexing for apache#892
chinyeungli wants to merge 16 commits into
mainfrom
631_extend_indexing_for_apache

Conversation

@chinyeungli

@chinyeungli chinyeungli commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Reference: #631

I implemented the PURL logic based on based on package-url/purl-spec#834 (comment)

In short, the pipeline get the file list from https://archive.apache.org/dist/zzz/find-ls2.txt.gz , filters it to collect only the files and paths we care about (archive‑type files), and then loads the package metadata from https://projects.apache.org/json/foundation/projects.json
It then assembles the information, constructs the PURLs, and passes them to "mine_and_publish_apache_packageurls" for mining/indexing

Following are some of the printed "base", "purls" and "purls_and_package_data" from the "_mine_and_publish_packageurls" function.

base: pkg:sid/apache.org/accumulo
purls: ['pkg:sid/apache.org/accumulo@1.10.4?file_name=accumulo-1.10.4-bin.tar.gz', 'pkg:sid/apache.org/accumulo@1.10.4?file_name=accumulo-1.10.4-src.tar.gz', 'pkg:sid/apache.org/accumulo@2.1.4?file_name=accumulo-2.1.4-bin.tar.gz', 'pkg:sid/apache.org/accumulo@2.1.4?file_name=accumulo-2.1.4-src.tar.gz', 'pkg:sid/apache.org/accumulo@3.0.0?file_name=accumulo-3.0.0-bin.tar.gz', 'pkg:sid/apache.org/accumulo@3.0.0?file_name=accumulo-3.0.0-src.tar.gz']
purls_and_package_data: [('pkg:sid/apache.org/accumulo@1.10.4?file_name=accumulo-1.10.4-bin.tar.gz', {'name': 'accumulo', 'version': '1.10.4', 'repository_homepage_url': 'https://accumulo.apache.org', 'repository_download_url': 'https://accumulo.apache.org/downloads', 'description': "The Apache Accumulo sorted, distributed key/value store is\n      based on Google's BigTable design. It is built on top of Apache Hadoop,\n      Apache Zookeeper, and Apache Thrift. It features a few novel improvements\n      on the BigTable design in the form of cell-level access labels and a\n      server-side programming mechanism that can modify key/value pairs at\n      various points in the data management process.", 'download_url': 'https://archive.apache.org/dist/accumulo/1.10.4/accumulo-1.10.4-bin.tar.gz', 'size': '24427395', 'release_date': '2023-11-16 21:12 UTC', 'mailing_list': 'https://accumulo.apache.org/contact-us', 'programming_language': 'Java'}), ('pkg:sid/apache.org/accumulo@1.10.4?file_name=accumulo-1.10.4-src.tar.gz', {'name': 'accumulo', 'version': '1.10.4', 'repository_homepage_url': 'https://accumulo.apache.org', 'repository_download_url': 'https://accumulo.apache.org/downloads', 'description': "The Apache Accumulo sorted, distributed key/value store is\n      based on Google's BigTable design. It is built on top of Apache Hadoop,\n      Apache Zookeeper, and Apache Thrift. It features a few novel improvements\n      on the BigTable design in the form of cell-level access labels and a\n      server-side programming mechanism that can modify key/value pairs at\n      various points in the data management process.", 'download_url': 'https://archive.apache.org/dist/accumulo/1.10.4/accumulo-1.10.4-src.tar.gz', 'size': '4140280', 'release_date': '2023-11-16 21:12 UTC', 'mailing_list': 'https://accumulo.apache.org/contact-us', 'programming_language': 'Java'}), ('pkg:sid/apache.org/accumulo@2.1.4?file_name=accumulo-2.1.4-bin.tar.gz', {'name': 'accumulo', 'version': '2.1.4', 'repository_homepage_url': 'https://accumulo.apache.org', 'repository_download_url': 'https://accumulo.apache.org/downloads', 'description': "The Apache Accumulo sorted, distributed key/value store is\n      based on Google's BigTable design. It is built on top of Apache Hadoop,\n      Apache Zookeeper, and Apache Thrift. It features a few novel improvements\n      on the BigTable design in the form of cell-level access labels and a\n      server-side programming mechanism that can modify key/value pairs at\n      various points in the data management process.", 'download_url': 'https://archive.apache.org/dist/accumulo/2.1.4/accumulo-2.1.4-bin.tar.gz', 'size': '40703931', 'release_date': '2025-08-20 21:49 UTC', 'mailing_list': 'https://accumulo.apache.org/contact-us', 'programming_language': 'Java'}), ('pkg:sid/apache.org/accumulo@2.1.4?file_name=accumulo-2.1.4-src.tar.gz', {'name': 'accumulo', 'version': '2.1.4', 'repository_homepage_url': 'https://accumulo.apache.org', 'repository_download_url': 'https://accumulo.apache.org/downloads', 'description': "The Apache Accumulo sorted, distributed key/value store is\n      based on Google's BigTable design. It is built on top of Apache Hadoop,\n      Apache Zookeeper, and Apache Thrift. It features a few novel improvements\n      on the BigTable design in the form of cell-level access labels and a\n      server-side programming mechanism that can modify key/value pairs at\n      various points in the data management process.", 'download_url': 'https://archive.apache.org/dist/accumulo/2.1.4/accumulo-2.1.4-src.tar.gz', 'size': '4435977', 'release_date': '2025-08-20 21:49 UTC', 'mailing_list': 'https://accumulo.apache.org/contact-us', 'programming_language': 'Java'}), ('pkg:sid/apache.org/accumulo@3.0.0?file_name=accumulo-3.0.0-bin.tar.gz', {'name': 'accumulo', 'version': '3.0.0', 'repository_homepage_url': 'https://accumulo.apache.org', 'repository_download_url': 'https://accumulo.apache.org/downloads', 'description': "The Apache Accumulo sorted, distributed key/value store is\n      based on Google's BigTable design. It is built on top of Apache Hadoop,\n      Apache Zookeeper, and Apache Thrift. It features a few novel improvements\n      on the BigTable design in the form of cell-level access labels and a\n      server-side programming mechanism that can modify key/value pairs at\n      various points in the data management process.", 'download_url': 'https://archive.apache.org/dist/accumulo/3.0.0/accumulo-3.0.0-bin.tar.gz', 'size': '35759308', 'release_date': '2023-08-21 21:52 UTC', 'mailing_list': 'https://accumulo.apache.org/contact-us', 'programming_language': 'Java'}), ('pkg:sid/apache.org/accumulo@3.0.0?file_name=accumulo-3.0.0-src.tar.gz', {'name': 'accumulo', 'version': '3.0.0', 'repository_homepage_url': 'https://accumulo.apache.org', 'repository_download_url': 'https://accumulo.apache.org/downloads', 'description': "The Apache Accumulo sorted, distributed key/value store is\n      based on Google's BigTable design. It is built on top of Apache Hadoop,\n      Apache Zookeeper, and Apache Thrift. It features a few novel improvements\n      on the BigTable design in the form of cell-level access labels and a\n      server-side programming mechanism that can modify key/value pairs at\n      various points in the data management process.", 'download_url': 'https://archive.apache.org/dist/accumulo/3.0.0/accumulo-3.0.0-src.tar.gz', 'size': '3633838', 'release_date': '2023-08-21 21:52 UTC', 'mailing_list': 'https://accumulo.apache.org/contact-us', 'programming_language': 'Java'})]

base: pkg:sid/apache.org/accumulo/accumulo-access
purls: ['pkg:sid/apache.org/accumulo/accumulo-access@1.0.0-beta3?file_name=accumulo-access-1.0.0-beta3-source-release.tar.gz']
purls_and_package_data: [('pkg:sid/apache.org/accumulo/accumulo-access@1.0.0-beta3?file_name=accumulo-access-1.0.0-beta3-source-release.tar.gz', {'name': 'accumulo-access', 'version': '1.0.0-beta3', 'repository_homepage_url': 'https://accumulo.apache.org', 'repository_download_url': 'https://accumulo.apache.org/downloads', 'description': "The Apache Accumulo sorted, distributed key/value store is\n      based on Google's BigTable design. It is built on top of Apache Hadoop,\n      Apache Zookeeper, and Apache Thrift. It features a few novel improvements\n      on the BigTable design in the form of cell-level access labels and a\n      server-side programming mechanism that can modify key/value pairs at\n      various points in the data management process.", 'download_url': 'https://archive.apache.org/dist/accumulo/accumulo-access/1.0.0-beta3/accumulo-access-1.0.0-beta3-source-release.tar.gz', 'size': '62745', 'release_date': '2026-04-29 00:53 UTC', 'mailing_list': 'https://accumulo.apache.org/contact-us', 'programming_language': 'Java'})]

base: pkg:sid/apache.org/accumulo/accumulo-classloader-extras
purls: ['pkg:sid/apache.org/accumulo/accumulo-classloader-extras@1.0.0?file_name=accumulo-classloader-extras-1.0.0-source-release.tar.gz']
purls_and_package_data: [('pkg:sid/apache.org/accumulo/accumulo-classloader-extras@1.0.0?file_name=accumulo-classloader-extras-1.0.0-source-release.tar.gz', {'name': 'accumulo-classloader-extras', 'version': '1.0.0', 'repository_homepage_url': 'https://accumulo.apache.org', 'repository_download_url': 'https://accumulo.apache.org/downloads', 'description': "The Apache Accumulo sorted, distributed key/value store is\n      based on Google's BigTable design. It is built on top of Apache Hadoop,\n      Apache Zookeeper, and Apache Thrift. It features a few novel improvements\n      on the BigTable design in the form of cell-level access labels and a\n      server-side programming mechanism that can modify key/value pairs at\n      various points in the data management process.", 'download_url': 'https://archive.apache.org/dist/accumulo/accumulo-classloader-extras/1.0.0/accumulo-classloader-extras-1.0.0-source-release.tar.gz', 'size': '60513', 'release_date': '2026-03-02 21:27 UTC', 'mailing_list': 'https://accumulo.apache.org/contact-us', 'programming_language': 'Java'})]

The base is the versionless PURL of the package.
The purls list contains all PURLs collected/constructed for that specific base package.
The purls_and_package_data is a list contains [(purl, metadata collected)]

Note that when the qualifier is file_name, it follows the common/standard URL construction:

https://archive.apache.org/dist/{namespace}/{name}/{version}/{file_name}

If the actual download url does not follow the above common syntax, for example, the version does not start with a digit, or the URL contains special segments such as "sources" or "binaries" that make it impossible to reconstruct the download URL from the PURL alone, then the PURL will use a download_url qualifier instead.

Since apache.org doesn’t provide an index file, we can’t directly identify newly added packages that need to be indexed. Storing every mined archive path in a checkpoint file would eventually make the file too large to process across the entire Apache repo. Instead of persisting all mined_packages, we use timestamp comparison to detect which packages are new and require indexing.

JonoYang and others added 12 commits June 26, 2026 17:56
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
…e as similar as the debian.py #637

Signed-off-by: Chin Yeung Li <tli@nexb.com>
 - Constructing purls based on package-url/purl-spec#834 (comment)

Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
…npm.py #631

Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
 * Only use timestamp to determine what packages need to be indexed

Signed-off-by: Chin Yeung Li <tli@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants